TemBERTure: advancing protein thermostability prediction with deep learning and attention mechanisms

Abstract Motivation Understanding protein thermostability is essential for numerous biotechnological applications, but traditional experimental methods are time-consuming, expensive, and error-prone. Recently, deep learning (DL) techniques from natural language processing (NLP) was extended to the field of biology, since the primary sequence of proteins can be viewed as a string of amino acids that follow a physicochemical grammar. Results In this study, we developed TemBERTure, a DL framework that predicts thermostability class and melting temperature from protein sequences. Our findings emphasize the importance of data diversity for training robust models, especially by including sequences from a wider range of organisms. Additionally, we suggest using attention scores from Deep Learning models to gain deeper insights into protein thermostability. Analyzing these scores in conjunction with the 3D protein structure can enhance understanding of the complex interactions among amino acid properties, their positioning, and the surrounding microenvironment. By addressing the limitations of current prediction methods and introducing new exploration avenues, this research paves the way for more accurate and informative protein thermostability predictions, ultimately accelerating advancements in protein engineering. Availability and implementation TemBERTure model and the data are available at: https://github.com/ibmm-unibe-ch/TemBERTure.


Ensemble Evaluation for Melting Temperature Prediction
To enhance a better melting temperature prediction of TemBERTure TM , we evaluated model ensembles on the validation set.These ensembles were constructed by selecting a subset of the initial 18 models, which covered all distinct initialization methods (random and transfer learning with TemBERTure CLS weights) and their duplicates.We explored three ensemble approaches: greedy algorithm, weighted ensemble, and a method leveraging TemBERTure CLS .Additionally, we experimented with various averaging techniques (standard deviation and interquartile range) to combine predictions and identify the optimal value for each data point.Overall, these ensemble strategies aimed to harness the strengths of multiple models and achieve effectiveness across a broad temperature range.

Averaging
The predictions were averaged across all replicas, resulting in an average melting temperature from 18 models per observation.

IQR and standard deviation
The predictions of all models were aggregated, We then identified and removed outliers using either the Interquartile Range (IQR) or a 3 standard deviation threshold.The remaining inliers are averaged for a final prediction.

Greedy search ensemble
The greedy ensemble approach aimed to identify an optimal combination of models to minimize Mean Absolute Error (MAE) on the validation set.We began by initializing the ensemble with the model having achieved the lowest MAE.We then iteratively evaluated the performance (based on MAE) by adding one additional model to the ensemble.If the average prediction resulted in a lower overall MAE, we updated the ensemble to include that model.The process continued until no further improvement was obtained for a maximum of 3 iterations.We also carried out a similar approach, setting the maximum number of models to 5.

Classification-Based Ensemble
This method leverages a two-stage ensemble approach for predicting melting temperature.
Each sequence in the validation set is assigned a class label (thermophilic or non-thermophilic) using the TemBERTure CLS model.For each class, a greedy search ensemble approach was employed to select the set of models that minimized the Mean Absolute Error (MAE) on the corresponding target melting temperature values.

Figure S1 :
Figure S1: Effect of fine-tuning on amino acid High Attention Score (HAS) frequency.Scatterplot comparing the high attention frequency of amino acids as identified by the pretrained protBERT-BFD 10 model (x-axis) versus the fine-tuned TemBERTure CLS model (y-axis) respectively for non-thermophilic sequences (A) and thermophilic sequences (B).Each point represents an amino acid, with its position reflecting the change in attention frequency after fine-tuning.Amino acids located above the diagonal have gained attention in the fine-tuned model and are colored in gray.

Figure S2 .
Figure S2.Amino acid frequency and HAS frequency.The bar chart presents a dual-layered comparison: the upper segment displays the frequency of individual amino acids within the TemBERTure DB test set, while the lower segment focuses specifically on the frequency of HAS amino acids.Red bars represent the prevalence of amino acids in thermophilic proteins, and blue bars denote their occurrence in non-thermophilic proteins.

Figure S3 :
Figure S3: Comparison of TemBERTure CLS attention scores on pairs of protein homologs.Scatter plots illustrate the attention scores between thermophilic and non-thermophilic paired Protein Data Bank (PDB) structures.Each plot corresponds to a unique pair, denoted by their respective PDB IDs.Red and blue markers indicate HAS for thermophilic and non-thermophilic respectively, and diamonds and circles differentiate between conserved and non-conserved amino acids, and triangles represent insertions.

Figure S4 :
Figure S4: Mapping of TemBERTure CLS attention score on protein structures.Set of 16 pairs of homologous non-thermophilic (the left of each pair) and thermophilic (the right of each pari) protein structures extracted from the Protein Data Bank.Each pair of protein structures is depicted side by side for comparative analysis.Regions with a higher attention score are thickened and colored in red

Table S5 : TemBERTure Tm
hyperparameter tuning.Optimal parameters are shown in bold.

Table S6 : Number of sequences in the TemBERTure DB classifier and regression datasets Table S7: Comparison between our TemBERTure CLS model and other state-of-the-art models on the TemBERTure DB test set. Table S8: TemBERTure CLS generalization ability.
TemBERTure CLS performance on available test sets, excluding sequences with over 50% identity, our TemBERTure DB validation and training sets.To account for the large class imbalance in the filtered dataset, we use macro-averaging for F1-score, recall, and precision.

Table S9 :
Ensemble performance on the regression validation set.Mean Absolute Error (MAE) computes the discrepancy between the predicted melting temperatures and the actual observed values.The coefficient of determination, denoted as R2, assesses the goodness of fit of the model in capturing the variability of the melting temperature.